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ESTIMATING TIME- VARYING NETWORKS 

By Mladen Kolar, Le Song, Amr Ahmed and Eric P. Xing * 
School of Computer Science, Carnegie Mellon University 

Stochastic networks are a plausible representation of the rela- 
tional information among entities in dynamic systems such as living 
cells or social communities. While there is a rich literature in es- 
timating a static or temporally invariant network from observation 
data, little has been done towards estimating time-varying networks 
from time series of entity attributes. In this paper, we present two 
new machine learning methods for estimating time- varying networks, 
which both build on a temporally smoothed /i -regularized logistic re- 
gression formalism that can be cast as standard convex-optimization 
problem and solved efficiently using generic solvers scalable to large 
networks. We report promising results on recovering simulated time- 
varying networks. For real datasets, we reverse engineer the latent 
sequence of temporally rewiring political network between Senators 
from the US senate voting records and the latent evolving gene net- 
work which contains more than 4000 genes from the life cycle of 
Drosophila melanog aster from microarray time course. 

1. Introduction. Networks are a fundamental form of representation 
of information. In many problems arising in biology, social sciences and 
various other fields, it is often necessary to analyze populations of enti- 
ties such as molecules or individuals, interconnected by a set of relation- 
ships (e.g., physical interactions or functional regulations in biological con- 
texts, and friendship, communication, etc. in social contexts). Studying net- 
works of these kinds can reveal a wide range of information, such as how 
molecules/individuals organize themselves into groups; which molecules are 
the key regulator or witch individuals are in positions of power; and how the 
patterns of biological regulations or social interactions are likely to evolve 
over time. While there is a rich (and growing) literature on modeling a static 
network at a single point in time, or time-invariant networks, much less has 
been done toward modeling the dynamical processes underlying networks 
that are rewiring over time; and on developing learning techniques for re- 
covering unobserved networks from time series of entity attributes. This 
structure estimation problem is especially important in the domains with 
very little prior knowledge about the existing interactions and relationships 
between actors. Once estimated, the structure of the network can be an- 
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alyzed by an expert to obtain a better understanding of the underlying 
processes in nature. Many existing procedures for the structure estimation 
assume that the structure is static and that the observed data is indepen- 
dent and identicahy distributed. The assumption of the structure stabihty 
can be seen as inappropriate in, for example, analysis of stock prices where 
interactions between stocks change over time, or a biological experiment that 
measures gene expression levels during the developmental cycle of an organ- 
ism. The focus of this paper is on the structure estimation of networks that 
evolve over time, and we restricted our analysis to undirected structures. 

Formally, the relationships between actors in a given network can be con- 
cisely described by a graph G{V,E), in which a node a ^ V denotes an 
actor; and an edge, (a, b) G denotes a unidirectional (e.g., a contacted b) 
or symmetric (e.g., a and b are friends) relationship between a pair of ac- 
tors in a complex system. Markov random fields (MRF) are an appropriate 
probabilistic model to represent these interactions. A Markov random field 
is specified by an undirected graph G = {V, E), where V = {1, . . . ,p} is a 
vertex set and E C V x V is diU edge set. Let X = (Xi, . . . , Xp)^ be the 
/^-dimensional random vector whose distribution P can be represented by 
the graph G. Each vertex from the set V is associated with one component 
of the random vector X. The edge set E of the graph G encodes certain 
conditional independence assumptions among subsets of the p-dimensional 
random vector X; is conditionally independent of Xj given the other 
variables if (z, j) ^ E. 

The fundamental problem we are going to address is that of a graphic 
structure estimation: given a sample of size n from Markov random 
field, estimate the structure of the underlying graph. In this work, we will 
discard the assumption that the sample is independent and identically dis- 
tributed (i.i.d.), and assume only that it is independent, i.e., instances can be 
drawn from possibly different MRF distributions. This relaxed assumption 
will allow us to estimate graphs with time varying structure. 

There has been a great amount of work done in estimating graphs with 
static structure. Assume for the moment that we are given an indepen- 
dent and identically distributed sample of n data points = {x^^^ = 
{xi \ . . . ,Xp'^)}f^i from P. Under the assumption that X is a multivariate 
normal distribution with mean vector jn and covariance matrix S, estima- 
tion of the graph structure is equivalent to the estimation of zeros in the 
inverse covariance matrix = S"^ (Lauritzen, 1996), commonly referred to 
as the concentration matrix. Drton and Perlman (2004) proposed a method 
that tests for partial correlations that are not significantly different from 
zero, which can be applied when the number of dimensions p is small in 
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comparison to the sample size n. In recent years, there has been a surge 
of data sets that are high dimensional with few samples, e.g. data coming 
from microarray experiments and fMRI data. These "large p, small n" data 
sets pose a difficult estimation problem, but under the assumption that the 
graph is sparse and the data is multivariate normal, several methods can be 
employed successfully for structure recovery. Meinshausen and Biihlmann 
(2006) proposed a procedure based on neighborhood selection of each node 
via the ii penalized regression. Their procedure selects the model consis- 
tently under a set of suitable conditions, of which the most important one 
is commonly known in the literature as the irrepresentable condition (Mein- 
shausen and Biihlmann, 2006; Wainwright, 2006; Zhao and Yu, 2006). Peng 
et al. (2008) propose a different neighborhood selection procedure for the 
structure etimation in which they estimate all neighborhoods jointly. The 
joint estimation procedure gives the global structure and improves on the 
neighborhood selection on a variety of networks. These methods based on 
the neighborhood selection are suitable for large-scale problems due to avail- 
ability of fast solvers to ii penalized problems (Efron et al., 2004; Friedman 
et al., 2007a). 

Another popular approach to the structure estimation is the ii penalized 
likelihood, which achieves simultaneously model selection and parameter es- 
timation. The penalized likelihood approach is much more computationally 
intensive since finding the estimate involves solving a semidefinite program 
(SDP) and a number of authors have worked on finding a more efficient way 
to solve the problem (Banerjee et al., 2008; Yuan and Lin, 2007; Friedman 
et al., 2007b; Duchi et al., 2008a; Rothman et al., 2008). Of these meth- 
ods, it seems that the graphical lasso (Friedman et al., 2007b) is the most 
computationally efficient. Other than the £i penalized likelihood estimation, 
some authors have proposed to use a non-concave penalty instead (Fan and 
Li, 2001; Fan et al., 2008; Zou and Li, 2008). This non-concave penalty tries 
to remedy the bias that the ii penalty introduces. 

If the random variable X is discrete, the problem of structure estima- 
tion becomes even more difficult since the likelihood cannot be optimized 
efficiently due to the problem of evaluation of the log-partition function. 
Wainwright et al. (2006) use a pseudo-likelihood approach, based on the lo- 
cal conditional likelihood at each node, similar to the neighborhood selection 
method of Meinshausen and Biihlmann (2006). This procedure results in a 
consistent neighborhood selection of each node, and hence in the consistent 
structure estimation. 

On the other hand, with few exceptions (Hanneke and Xing, 2006; Sarkar 
and Moore, 2006; Guo et al., 2007; Zhou et al., 2008), much less has been 
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done towards modeling dynamical processes that guide topological rewiring 
and semantical evolution of networks over time. Efficient techniques need to 
be developed for recovering unobserved network topologies from observed at- 
tributes of entities constituting the network. Hanneke and Xing (2006) pro- 
posed a new class of models, called temporal Exponentially Random Graph 
Models (tERGMs), for modelling networks evolving over discrete time steps. 
tERGM uses statistics like "edge-stability", "reciprocity", "density", and 
"transitivity" of time- adjacent graphs to construct a log-linear graph transi- 
tion model P{G^\G^~^) that captures graph rewiring dynamics. Based on the 
tERGM, a hidden tERGM (htERGM) can be formulated to impose stochas- 
tic constraints on latent rewiring graphs. Similar to HMM, in principle one 
can infer time-specific network topology from the posterior distribution of 
given the time series of node attributes under htERGM (Guo et al., 
2007). However, even though the htERGM is quite expressive, a Gibbs sam- 
pling algorithm for posterior inference is very inefficient and scales only to 
a few tens of nodes; developing a tractable algorithm for inference in the 
htERGM remains an open problem. Moreover, Zhou et al. (2008) develop a 
nonparametric method for estimation of the concentration matrix that varies 
smoothly over time. Their method produces a consistent estimate with the 




under a set of suitable assumptions. 



With the avalanche of high-dimensional time-series data such as genome- 
wide gene expression time courses and large-scale social media streams, a 
key technical hurdle that prevents us from in-depth investigations of the 
mechanisms underlying these data, such as the dynamic gene regulation 
in developing organisms or the emergence/disingetration of cryptic liaisons 
and communities in society, is the unavailability of serial snapshots of the 
rewiring network during the progression of the process in question. Indeed, it 
is often technically impossible to determine time-specific network topologies 
in many natural or socio-cultural systems, and the only remaining option 
is to resort to computational-based statistical inference. With the current 
results reviewed above, there is an apparent gap between imminent method- 
ological needs and the available data. In this paper, we narrow this gap by 
presenting two practical algorithms for estimating time- varying graphs. 

The paper is organized as follows. In Section 2 we describe the proposed 
models for estimation of the time varying graphical structures and the al- 
gorithms for obtaining the estimators. In Section 3, the performance of the 
methods is demonstrated through simulation studies. In Section 4, the meth- 
ods are applied to some real world data sets. In Section 5, we discuss some 
theoretical properties of the algorithms, however, the details are left for a 
separate paper. Discussion is given in Section 6. 
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2. Methods. Let ~ be a p dimensional random vector dis- 
tributed according to some distribution P^. We will assume that the dis- 
tribution changes with the time index t and for convenience we choose to 
index time as = {1/n, 2/n, ...,1}. Given a sample of independent ob- 
servations = {X^ ^ I t G T^}, the task is to estimate the undirected 
graph = (V^E^) corresponding to the distribution P^. Primarily, we will 
be interested in estimation of networks from "large p, small n" data sets^. 
The estimation problem is ill-posed, since there are more parameters than 
samples. To deal with the problem, we will assume that the true networks 
are sparse, or, at least, that the main interactions inside the network can 
be approximated with a sparse model. In many real data sets this sparsity 
assumption holds quite well. For example, in a genetic network, rarely a reg- 
ulator gene would control more than a handful of regulatees under a specific 
condition (Davidson, 2001). 

The estimation of the time varying networks is considered in two scenarios, 
which we formally define later: (a) the distribution P^ changes smoothly 
over time, (b) the distribution P^ is constant over periods of time with some 
sudden sharp changes. For the case (a), where the distribution P^ changes 
smoothly over time, we will use a nonparametric kernel estimate. In the 
case (b), the problem is to detect change points, the time points where the 
network exibits sharp changes and estimate the network between change 
points. The case (b) is known as estimation with structural changes. We 
propose to use a total variation penalty which enforces similarity between 
parameters that are close in time. The total variation penalty has been 
successfully applied in detecting sharp transitions in signal processing, image 
denoising and nonparametric estimation, e.g Rudin et al. (1992); Mammen 
and Geer (1997); Davies and Kovac (2001). It has also been successfully 
applied to model selection in linear regression to identify blocks of nonzero 
coefficients Tibshirani et al. (2005); Friedman et al. (2007a). 

In particular, we will be concerned with estimation of graphs from discrete 
data. This encompasses a large class of problems in practice. For instance, 
in social science, discrete data, such as movie rating data and senate voting 
data, are prevalent. Or in neural science, the sequence of action potentials 
from neurons can be view as discrete pulses. Or in biological science, mea- 
surements from microarray experiments are discretized to robustly capture 
the qualitative activity of the genes. 

Formally, let be a random vector taking its values in { — 1,1}^ and 
let = (V, E^) be its associated graph. For simplicity, we will consider 

^We allow dimension p = p(n) to grow with the sample size n, however, we usually 
suppress this in our notation 
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binary Markov random fields (MRFs) with pairwise interactions, in which 
the probabihty distribution of takes the form 



where Z{9) is the partition function that ensures that P is a probabihty 
distribution and 9 is the parameter vector. This type of MRF is known as 
the Ising model. 

Let = {X^ P^i*,* I t E T^} be an independent sample. Note that each 
sample X^ is coming from a different distribution P^*,t indexed by 0*'^ and 
that there is an undirected graph — (V^ E^) associated with it. It is going 
to be useful to consider the parameter vector 0*'^ as a (|) -dimensional vector, 
indexed by distinct edges, and non-zero if and only if the edge (u^v) G E^. 
The problem of recovering the structure of the graph is now equivalent 
to estimating non-zero elements of the vector 0*'^. 

Estimation of the parameter vector ^*'^ by maximizing log-likelihood is 
not practically feasible due to the problem of evaluation of the log-partition 
function. One approach to overcoming this problem is to use a surrogate 
likelihood function which can be tractably optimized. However, for an esti- 
mate 6^, obtained through maximization of a surrogate likelihood, one cannot 
guarantee how close 9 and ^*'^ are in general. That said, we have decided 
not to use the approach of Banerjee et al. (2008) to estimate discrete MRFs. 
Instead, we adapt the neighborhood selection procedure of Wainwright et al. 
(2006) to estimate time-varying MRFs. 

We continue with a presentation of the algorithm for the structure recov- 
ery that is based on the neighborhood recovery of each node. Observe that 
recovering the edge set E^ of the graph G^ is equivalent to recovering, for 
each vertex u ^ its neighborhood set J\f{u^t) ^ {v ^ V \ {u^v) ^ E^}. 
The neighborhood set Af{u, t) can be estimated from the sparsity pattern of 
the (p—1) -dimensional subvector of parameters 0*^ := {0*^ | v G V\u} asso- 
ciated with vertex u. Now, under the model (1), the conditional distribution 
of given other variables Xt takes the form 



where (a, 6) = a^h denotes dot product. The log-likelihood, for the node 
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can be written in the following form: 
7(^*;x*) = \og^0t{xi\x') 

(3) 

= x\u) - log (exp((^(„, x{^)) + eM-{0{u, ^J)) • 

In the i.i.d. case, when the graph structure does not change over time, 
Wainwright et al. (2006) have considered an penalized maximum likeli- 
hood approach for estimation of the signed neighborhood using (2). How- 
ever, since we do not have identically distributed sample, we have to modify 
the approach. We are going to consider two types of changes: (a) in which 
the model changes smoothly over time and (b) where the model changes it 
structure at time points in the set T . 



2.1. Smooth changes for discrete MRFs. We start by describing how to 
estimate the neighborhood of the node u. For the point of interest to G [0, 1], 
we define our estimator 9 as the solution to the following minimization 
problem: 

(4) min {/(0;p-) + Aj|0||J 
where 

(5) mv^) = - J2 wMe;x') 

is a weierhted loe^loss, with weierhts defined as Wf = ^^ni't to) 

^hni') — K(-/hn) is a kernel. The parameter A^^ > is a regularization 
parameter specified by user that controls the size of the estimated neighbor- 
hood. The kernel Kh^{-) is a symmetric nonnegative function. 

Let ^ be a minimizer of (4). Based on the vector 0, we construct the 
estimate of the neighborhood: 

(6) AA(ix,to) :={v\ve V\u, 9uv + o} . 

Note that the procedure described above gives estimate only at the time 
point to, so it is usually run for many time points to get insight into the 
structure dynamics. In the rest of this paper, we refer to this approach as 
smooth. 

The optimization problem (4) is the well known objective of i\ penalized 
logistic regression and there are many ways of solving it. For completness, 
we mention few scalable solutions. Koh et al. (2007) develop a speciahzed 
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interior point method for solving £i penalized logistic regression. Using La- 
grangian duality, problem (4) can be written in the equivalent constrained 
form 



l<C(Xn) 



for some constant C(A^) < +oc. Similar to Duchi et al. (2008a), a subgradi- 
ent descent algorithm with projection onto the £i ball (Duchi et al., 2008b) 
solves the problem (7) very efficiently. To obtain results in this paper we use 
a fast coordinate- wise descent method described in Friedman et al. (2008). 
From our limited experience, the specialized first order method works faster 
than the interior point method. 

The coordinate-wise descent method can be described schematicaly as 



1. Set initial values 



2. For z = 1, . . . , p — 1, set the current estimate ^J0'^^^^+^ as a solution to 
the following optimization procedure 
(8) 

+Ai|0| 



mm 



3. Repeat step 2 until convergence 
For efficient way of solving (8) refer to Friedman et al. (2008). 

2.2. Structural changes in discrete MRFs. While most phenomena in 
real life change smoothly, the rate at which we observe them might not reveal 
this smooth change. Indeed, while most edges in a given network evolve 
(appear or disappear) smoothly over time, the sampling rate might give the 
impression that some of these edges change abruptly. This abrupt change 
violates some of the assumptions of the smooth method (see Appendix). In 
this section, we investigate a different approach, which we will call TV, that 
is suitable for recovering abruptly changing networks. 

We propose to estimate the neighborhood of the node u by solving the 
following optimization problem 

(9) = argmin{^ 7(0*;x*) + Ai ^ ||^*||^ + A2 E TV{el)}. 

Here — {(/i, /2, • • • , /p-i)"^ | fi ^ ^-j-n} is a class of functions over 

which we are optimizing and Tj- is class of univariate functions defined on 

imsart-aoas ver. 2008/08/29 file: paper_aoas_2008.tex date: December 30, 2008 



ESTIMATING TIME- VARYING NETWORKS 



9 



[0, 1] that are piecewise constant and right continuous, with discontinuities in 
the set T. It is worth noting that the solution to problem (9) gives estimate 
of the neighborhood for node u for every time and not just for the time 
to as was the case in (4). 

The penalty is structured as a combination of two terms. The first term 
penalizes the complexity of the neighborhood estimate at each time point in 
and the tuning parameter Ai regulates the size of the estimated neigh- 
borhood. The second term penalizes the difference between coefficients that 
are adjacent in time and, as a result, biases the estimate towards functions 
that are "blocky. " This composite penalty was also successfully applied in a 
slightly different setting of linear regression with correlated covariates, where 
it is known as the "fused" lasso penalty (Tibshirani et al., 2005). 

The optimization problem (9) is convex and hence it can be solved effi- 
ciently, however, it is more complex than problem (4) and we do not know of 
any specialized algorithms that can be used to solve it. Here we spend some 
time discussing the optimization procedure for problem (9), which turned 
out to be much more efficient than the existing "off the shelf" solvers. 

We start from the observation that the loss function can be factorized 
as C{6) = h (9i,92, . . . , 9^ + Y^\^i g {0-) for some functions h and g. Tseng 
(2001) established that the block-coordinate descent converges for the loss 
functions with such structure, where h is differentiable and g is convex. 
Based on this observation we propose the following algorithm: 



1. Set initial values 



2. For z = 1, . . . , p — 1, set the current estimate 9?^^^.^^^ as a solution to 
the following optimization procedure 
(10) 

Intern 7 [f^\u,i , • • • , ^ ^ ^ • • • ' ^ 

+\2TV{9') 



mm < 



3. Repeat step 2 until convergence 

Using the proposed block-coordinate descent algorithm, we solve a se- 
quence of optimization problems (10) with n variables, instead of solving 
one big n * (p — 1) optimization problem (9). In practice, this gives us a 
faster algorithm and we are able to solve larger problems. It is still an open 
question how to solve (10) in the most efficient way, however we do not pur- 
sue it here. For the problems considered in this paper, it is sufficient to use 
the optimization package CVX (Grant and Boyd, 2008) to solve for (10). 
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2.3. Multiple observations. In the presentation of the two algorithms, 
we have assummed that at each time point in there is only one obser- 
vation. It can be easily imagined that there are multiple samples at each 
time point, e.g. a controlled repeated micro- array experiment in which two 
samples obtained at certain time points could be regarded as independent 
and identically distributed. Here, we describe how one can easily adapt the 
two procedures to this setting and later, in Section 3, we show how the esti- 
mation procedure benefits from additional observations at each time point. 

There are virtually no modifications needed to accomodate multiple ob- 
servations at different time points in Equation (4). Each additional sample 
will be assigned the same weight through the kernel function Kh^(-). Only 
one small change is needed to modify Equation (9) which becomes 

(11) = argmin{^ ^ l{e';x)+\r ^ ||^*||^+A2 E TV{el)}. 

eeJ='p 1 teT^ xev^^i ter^ vev\u 

The set P^'^ denotes elements from sample observed at time point t. 

2.4. Choosing tuning parameters. Both algorithms for estimating net- 
work with smooth changes (smooth) and for estimating network with struc- 
tural changes (TV) require a choice of tuning parameters. These tuning pa- 
rameters control sparsity of estimated networks and rate at which network 
changes over time. For both methods, the sparsity of estimates is controlled 
with the penalty penaparameter Ai. Rate of change is controlled through 
the bandwidth parameter hn in the method smooth and with the penalty 
parameter A2 in the method TV. Choosing the tuning parameters is of ex- 
treme importance in getting a good estimate that does not overfit the data. 
Large values of the penalty parameter Ai result in a sparse estimate, while 
small values result in a dense model that has higher log-likelihood, but more 
degrees of freedom. The bandwidth parameter hn controls smoothness of es- 
timated networks in smooth method; small values give estimates that change 
often over time, and large values result in estimates that are almost time 
invariant. Intuitively, the bandwidth parameter can be thought of as a size 
of the neighborhood of samples that are used to construct a network at time 
point to. The penalty parameter A2 controls the smoothness in a slightly dif- 
ferent way; large values penalize the difference between parameters adjacent 
in time and bias solution towards a network that is slowly changing, while 
small values allow for more changes in estimates. 

To tune these parameters, one should first note that j{9^;x^) in both 
(9) and (4) represents a logistic regression loss function for each node, thus 
each of (9) and (4) can be regarded as a supervised classification problem 
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where each node u is regressed on the other nodes V\u. Many techniques 
can be used to select the tuning parameters among different candidates. 
When there are enough data, cross-vahdation, or held-out datasets can be 
used, otherwise, the BIC score can be employed. We define the BIC score 
for {e{^,..., 9^:) to be 

(12) 

t=l xGP^'* 

where Dim(-) denotes the dimensionality of the estimated parameters. Simi- 
lar to Tibshirani et al. (2005), we adopt the following definition which counts 
the number of runs of non-zero parameter values 

(13) 

T" 

Dim(9(„, . . . , e{:) = E E ^[sigii(0 7^ sign(C')] x l[sign(^tJ 7^ . 

t=i vev 

Note that the above definition is lenient to perturbations to the value of the 
actual parameters, whereas the definition in Tibshirani et al. (2005) strictly 
increases the dimensionality of the model with these perturbations. Our 
intuition here is that a non-zero parameter corresponds to an edge, thus the 
measure in (13) counts the number of edge changing events including change 
in polarity. The BIC score in (12) can be applied to each node independently 
which allows for nodes to have neighborhoods of different sizes and different 
rates of change for these neighborhoods. 

In the kernel reweighting approach, the tuning of bandwidth parameter 
hn should trade off the smoothness of the network changes and the cov- 
erage of samples used to estimate the network. Using a wider bandwith 
parameter provides more samples to estimate the network, but this risks 
missing sharper changes in the network; using a narrower bandwidth pa- 
rameter makes the estimate more sensitive to sharper changes, but this also 
makes the estimate subject to larger variance due to the reduced effective 
sample size. In this paper, we adopt a heuristic for tuning the inital scale of 
the bandwidth parameter: we set it to be the median of the distance between 
pairs of time points. That is, we first form a matrix (dij) with its entries 
dij •= {ti — tjY {ti^tj ^ T^)- Then the scale of the bandwidth parameter is 
set to the median of the entries in {dij). In our later simulation experiments, 
we find that this heuristic provides a good initial guess for /i^, and it is quite 
close to the value obtained via more exhaustive search. 
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3. Simulation studies. In this section, we perform experiments on 
simulated data to access the performance of the proposed methods. The 
experiments are conducted in a way to give insight on when to use the 
kernel smoothing method, which we denote smooth, and when to use the 
method based on the total variation penalty, which we denote TV. It is also 
instructive to compare these two methods with the neighborhood selection 
method which assumes that the graph is time invariant Wainwright et al. 
(2006), which we denote static. For a given time point to G [0, 1], all three 
methods produce two estimates 9uv and 0yu for each edge parameter. This 
is the artifact of the algorithms in which neighborhoods for nodes u and v 
are estimated separately. There are two ways to symmetrize the solution: 



For each of the three methods, we will denote the solutions obtained through 
different symmetrization as ****.MIN and ****.MAX. 

We simulate our data from a random network, which is generated using 
the following procedure. The network has p nodes that are connected with e 
randomly added edges between pairs of nodes. The edges are added in such 
a way that the maximum degree of a node is d. Next, each edge uv ^ E is 
assigned the potential 0*^ uniformly at random from the interval [^min, ^max]- 
This procedure creates the initial network. In addition, we consider two 
methods for generating time evolving network. The first method generates 
a network that smoothly evolves over time, while the second one generates 
a network that is piecewise constant. 

To simulate the smoothly evolving network, we randomly choose ^change 
edges to delete from the graph and randomly choose ^change edges to add 
to the graph. For the edges to be added to the graph we also assign random 
potentials. The potentials of the edges are gradually decreased to zero or in- 
creased to the randomly assigned value over a period of time.change discrete 
time steps. This process of adding and removing edges is repeated several 
times. At each discrete time step we draw numJid independent samples 
from the network using the Gibbs sampling. Note that using this procedure 
results in networks that almost always have e + ^change edges. 

The piecewise constant network is simulated in a similar fashion as a 
smoothly evolving one. The main difference is that we do not add and re- 
move edges over a period of discrete time steps, instead, the network is kept 
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constant for time-change discrete time steps and then the edges are added 
and removed to the network. 

We evaluate the estimation procedures by estimating the network at 9 
time steps T = {0.1, 0.2, . . . , 0.9}. We express our results in terms of Fl 
score, which is computed as ^p^^ with P denoting precision and R denoting 
recall. Let & denote the estimated edge set in &. Precision is calculated as 
^^^^^^ ^ and recall as The Fl score is a natural choice of the accuracy 

measure as it tries to balance between precision and recall. 

3.1. Simulation of smoothly varying networks. We generate the initial 
random graph with p = 20 nodes and e = 15 edges. Each node of the graph 
has at most 5 = 4 neighbors. Parameters 0uv for each edge is drawn uni- 
formly at random from [0.5, 1]. Next, we randomly add 10 edges to the graph 
and choose 10 edges to remove from the graph. These edges are added and 
removed from the graph over 100 discrete time steps. At each discrete time 
step we generate numJid i.i.d. samples. In our experiments, the parame- 
ter numJid is going to be changed from 1 to 10. We repeat the process of 
adding and removing edges 5 times, so in total we simulate the changing 
network over 500 discrete time steps. On average our network has 25 edges. 
We report results averaged over 20 independent runs. 

The choice of the tuning parameter Ai and the bandwidth hn is of extreme 
importance for the estimation procedure. The number of edges included in 
the graph is controlled with the parameter Ai. As Ai increases, the esti- 
mated graph gets sparser. For the case with large Ai, the estimated graph 
is sparse and we have large precision, however, the recall is very low. On 
the other hand, with small Ai, the estimated graph contains many spurious 
edges and while recall is high, the precision is quite low. A good choice of the 
parameter Ai will balance the two extremes and result in a good Fl score. 
The bandwidth controls the weight given to points in the neighborhood 
of time t for which we want to estimate a graph. The larger the bandwidth, 
the more points that are included in the neighborhood and the resulting 
estimate does not reflect the changes in the network, i.e. the solution is un- 
dersmoothed. If the bandwidth hn is too small, only few points are included 
in the neighborhood and the resulting estimate is very unstable, i.e. the so- 
lution is oversmoothed. A similar role is played by the tuning parameters 
Ai and A2 in the TV procedure. The parameter A2 can be thought of as a 
parameter that controls the number of "blocks" that solution consists of. 
In each of these blocks, the network does not change, and so implicitly the 
parameter A2 controls the size of the neighborhood, similarly to the way 
that the bandwidth parameter hn controls the size of the neighborhood in 
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the smooth procedure. 

Here, in the simulation experiments, we choose the bandwidth parameter 
as explained in Section 2.4. We run the algorithm over a grid of parameters, 
(Ai, hn) for smooth procedure and (Ai, A2) for TV procedure, and choose one 
that optimizes the BIC-type criterion. 




Fig 1. Smoothly changing network. Fl score as a function of Xi. At each time step there 
were numJid = 1 independent samples. The bandwidth parameter was set to hn = 0.182. 
The TV parameter was set to A2 = 0.223. 




Fig 2. Smoothly changing network. Fl score as a function of the number of independent 
samples at each discrete time step numAid. 



We start by reporting results for a fixed bandwidth hn and a fixed pe- 
nalization parameter A2. The results are given in Figure 1 as a function 
of the penalization parameter Ai. The fixed value of tuning parameters is 
somewhat arbitrary as we have used values close to the values picked by a 
BIC-type criterion, however, the aim of these plots is to give intuition on 
how Fl score varies with respect to the tuning parameter Ai. From Figure 1 
we can see that the smooth method performs better than TV, which in turn 
performs better than the static method for most of the range of tuning 
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parameters Ai. The reason for small values of the penalty parameter Ai all 
methods perform alike is that the estimated graph is too dense and using an 
automatic method for picking Ai never chooses values from this region. As 
expected, smooth outperforms TV on the smoothly varying network, mainly 
because TV estimates a solution with many blocks and implicitly, as a result, 
estimates network from a small local neighborhood with few samples. It is 
worth noting that the performance of the static method degrades if the 
simulation is carried over a large number of time steps, which comes with 
no surprise as the static model assumption gets more violated. 

Figure 2 represents Fl score averaged over 20 runs as a function of number 
of independent samples at each time step. Note that the performance of both 
methods smooth and TV increases with the sample size. 

There are some differences between the estimates obtained through MIN 
and MAX symmetrization. In our limited numerical experience, we have seen 
that MAX symmetrization outperforms MIN symmetrization. MIN symmetriza- 
tion is more conservative in including edges to the graph and seems to be 
more susceptible to noise. 

3.2. Simulation of piecewise constant networks. To generate data from 
a piecewise constant network, we generate the initial random graph with 
p = 20 nodes and e = 25 edges, with constraint that each node has at 
most 5 = 4 neighbors. Parameters 0uv for each edge are drawn uniformly at 
random from [0.5, 1]. We generate samples from this graph over 100 discrete 
time steps, at each step drawing numJid i.i.d. samples, where numJid 
ranges from 1 to 10. Next, we randomly remove 10 edges and add 10 edges 
to the graph and repeat the process of drawing samples over 100 discrete 
time steps. The process of changing graph is repeated 5 times, so that at the 
end we have simulated a network over 500 discrete time steps. We report 
results averaged over 20 independent runs. 

Figure 3 gives Fl score as a function of the penalization parameter Ai, 
for fixed values of the other tuning parameters. For the piecewise constant 
network, we can see that the TV method tends to do better than smooth 
method. The reason is that TV estimates jumps relatively accurately and 
the solution is not affected with the structural changes in the model. On 
the other hand, smooth tends to suffer when the network is estimated at a 
point close to a change point since samples from both sides of the change are 
included into a neighborhood. As we increase sample size at each time step. 
Figure 4 shows the performance of the methods. We would like to emphasize 
here that the TV method fits the network at 500 different time points, at every 
discrete time point. However, for the comparison with the smooth method, 
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(a) 



(b) 



Fig 4. Piecewise constant network. Fl score as a function of the number of independent 
samples at each discrete time step numJid. 
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Fig 5. 109th Congress, Connections between Senators in April 2005. 



we only evaluate network topology at time points T = {0.1, 0.2, . . . , 0.9}. 
This decision affects the final result, since the network is constant at most 
of these points, and the two methods seem equivalent. While the TV method 
is able to estimate changes in the network, the smooth method performs 
somewhat unpredictable close to change points. 

4. Experiments. In this section we present the analysis of two real 
data sets using the algorithms presented in Section 2. First, we present the 
analysis of the senate data consisting of Senators' votes on bills during the 
109th Congress. The second data set consists of expression levels of more 
than 4000 genes from the life cycle of Drosophila melanogaster. 

4.1. Senate Voting Records Data. The US senate data consists of voting 
records from 109th congress (2005 - 2006)^. There are 100 senators whose 
votes were recorded on the 542 bills. Each senator corresponds to a variable, 
while the votes are samples recorded as -1 for no and 1 for yes. This data set 
was analyzed in Banerjee et al. (2008), where a static network was estimated. 
Here, we analyze this data set in a time varying framework in order to 
discover how the relationship between senators changes over time. 

This data set has many missing values, corresponding to votes that were 
not cast. We follow the approach of Banerjee et al. (2008) and fill those 
missing values with (-1). Bills we mapped onto a [0, 1] interval, with rep- 
resenting Jan 1st, 2005 and 1 representing Dec 31st, 2006. To estimate the 

^The data can be obtain from the U.S. Senate web page http://www.senate.gov 
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network we have chosen the bandwidth parameter to be hn = 0.174 and the 
penalty parameter Ai = 0.195 for the method smooth, and penalty parame- 
ters Ai = 0.24 and A2 = 0.28 for the method TV. In our figures, square nodes 
represent republicans and circle nodes represent democrats. 

At any time point t, the estimated network contains few clusters of nodes. 
These clusters consist of either Republicans or Democrats connected to each 
others (see Figure 5 ). There are few links connected to different clusters, 
as well as isolated Senators. This observation supports our intuition that 
most senators vote similarly to other members of their party. Links con- 
necting different clusters usually go through senators that are members of 
one party, but have views more similar to the other party, e.g. Senator Ben 
Nelson or Senator Chafee. We do not necessarily need to estimate a time 
evolving network, as these simple observations can be seen from an estimate 
obtained using a procedure that estimates a static network (e.g. Banerjee 
et al. (2008)). 

We are more interested in finding a time evolving pattern. First, we exam- 
ine a neighborhood of the node representing Senators Corzine and Menendez. 
Senator Jon Corzine stepped down from the Senate at the end of the 1st 
Session in the 109th Congress to become the Governor of New Jersey. His 
place in the Senate was filled by Senator Bob Menendez. In our network, we 
have one node representing both senators, so at the end of the 1st Session we 
would expect to see some changes in the network, even though both senators 
are Democrats. Figure 6 presents first neighbors of the node at four different 
time steps, two during each Session. Note that this observation could not be 
seen from a static network. 



Co0iie 





(a) t = 0.1 



(b) t = 0.3 



(c) t = 0.6 



(d) t = OA 



Fig 6. Neighbors of the node that represents Senator Corzine and Senator Menendez at 
four different time points 



Next, we move onto the analysis of the neighbors of Senator Ben Nelson. 
Senator Ben Nelson is a Democrat from Nebraska and is considered as one of 
the most conservative Democrats in the Senate. Figure 7 presents neighbors 
of Senator Ben Nelson of degree two or less at two time points, one during 
the 1st Session and one during the 2nd Session. As a conservative Democrat, 
we expect him to have some Democrat and some Republican neighbors since 
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(a) t = 0.2 (b) t = 0.8 



Fig 7. Neighbors of Senator Ben Nelson (degree two or lower) at the beginning of 109th 
Congress and at the end of 109th Congress 

he shares views with both parties. This observation is supported by Figure 
7(a) which presents his hnks during the 1st Session. It is also interesting 
to note that during the second session, his views drifted more towards the 
Repubhcans (Figure 7(b)). More detailed analysis of the type of bills passed 
during that period of time should be performed to explain this observation. 
We try to explain the observation by noting his political views on abortion, 
where he stands as a pro-life Democrat, and his political views on Iraq, in 
which he stood against withdrawal of most of the combat troops from Iraq. 




(a) t = 0.1 (b) t = 0.4 (c) t = 0.8 



Fig 8. Neighbors of Senator Chafee (degree two or lower) 

We report the results of the method TV in analysis of neighbors of Senator 
Lincoln Chafee. Figure 8 presents neighbors of Senator Chafee at three time 
points during the 109th Congress. We make a tentative observation that 
his neighborhood includes an increasing amount of Democrats as the graph 
is estimated at a later stage during the 109th Congress. Senator Chafee 
left the Republican Party and became an independent in 2007. Also, his 
view on abortion, gay rights and environmental policies are strongly aligned 
with those of Democrats, which supports behavior seen in the estimated 
network. Again, these observations could not be made if a static network 
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was estimated. 

4.2. Gene Regulatory Networks of Drosophila Melanogaster. In this sec- 
tion, we used the kernel reweighting approach to reverse engineer the gene 
regulatory networks of Drosophila melanogaster from a time series of gene 
expression data measured during its full life cycle. Over the developmental 
course of Drosophila melanogaster^ there exist multiple underlying "themes" 
that determine the functionalities of each gene and their relationships to each 
other, and such themes are dynamical and stochastic. As a result, the gene 
regulatory networks at each time point are context-dependent and can un- 
dergo systematic rewiring, rather than being invariant over time. In a semi- 
nal study by Luscombe et al. (2004), it was shown that the "active regulatory 
paths" in the gene regulatory networks of Saccharomyces cerevisiae exhibit 
topological changes and hub transience during a temporal cellular process, 
or in response to diverse stimuli. We expect similar properties can also be 
observed for the gene regulatory networks of Drosophila melanogaster. 

We used microarray gene expression measurements from Arbeitman et al. 
(2002) as our input data. In such an experiment, the expression levels of 
4028 genes are simultaneously measured at various developmental stages. 
Particularly, 66 time points are chosen during the full developmental cycle 
of Drosophila melanogaster, spanning across four different stages, i.e. embry- 
onic (1-30 time point), larval (31-40 time point), pupal (41-58 time points) 
and adult stages (59-66 time points). In this study, we focused on 588 genes 
that are known to be related to developmental process based on their gene 
ontologies. 

Usually, the samples prepared for microarray experiments are a mixture of 
tissues with possibly different expression levels. This means that microarray 
experiments only provide rough estimates of the average expression levels 
of the mixture. Other sources of noise can also be introduced into the mi- 
croarray measurements during, for instance, the stage of hybridization and 
digitization. Therefore, microarray measurements are far from the exact val- 
ues of the expression levels, and it will be more robust if we only consider 
the binary state of the the gene expression: either being up-regulated or 
down- regulated. For this reason, we binarize the gene expression levels into 
{0, 1} (-1 for down- regulated and 1 for up-regulated). We learned a squence 
of binary MRFs from these time series. 

In Figure 9(a), we plotted two different statistics of the reversed engi- 
neered gene regulatory networks as a function of the developmental time 
point (1-66). The first statistic is the network size as measured by the num- 
ber of edges; and the second is the average local clustering coefficient as 
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Fig 9. Characteristic of the dynamic networks estimated for the genes related to develop- 
mental process, (a) Plot of two network statistics as functions of development time line, 
(h) and (c) visualization of two example of networks from different time point. We can see 
that network size can evolve in a very different way from the local clustering coefficient. 



defined by Watts and Strogatz (1998). For comparison, we normalized both 
statistics to the range between [0, 1]. It can be seen that the network size 
and its local clustering coefficient follow very different trajectories during 
the developmental cycle. The network size exhibits a wave structure fea- 
turing two peaks at mid-embryonic stage and the beginning of pupal stage. 
Similar pattern of gene activity has also been observed by Arbeitman et al. 
(2002). In contrast, the clustering coefficients of the dynamic networks drop 
sharply after the mid-embryonic stage, and they stays low until the start 
of the adult stage. One explanation is that at the beginning of the devel- 
opment process, genes have a more fixed and localized function, and they 
mainly interact with other genes with similar functions; however, after mid- 
embryonic stage, genes become more versatile and involved in more diverse 
roles to serve the need of rapid development; as the organism turns into an 
adult, its growth slows down and each gene is restored to its more specialized 
role. To illustrate how the network properties change over time, we visual- 
ized two networks from mid-embryonic stage (time point 15) and mid-pupal 
stage (time point 45) in Figure 9(b) and 9(c) respectively. Although the size 
of the two networks are comparable, we can see that there are much more 
clear local cluster of interacting genes during mid-embryonic stage. 

To judge whether the learned networks make sense biologically, we zoom 
into three groups of genes functionally related to different stages of devel- 
opment process. In particular, the first group (30 genes) is related to em- 
bryonic development based on their functional ontologies; the second group 
(27 genes) is related to post-embryonic development; and the third group 
(25 genes) is related to muscle development. We used interactivity, which 
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Embryonic Development Genes Post Embryonic Development Genes Muscle Development Genes 




10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 
Time Point Time Point Time Point 



(a) (b) (c) 

Fig 10. Interactivity of 3 groups of genes related to (a) embryonic development; (h) post- 
embryonic development and (c) muscle development. The higher the interactivity, the more 
active the group of genes. We see that the interactivity of the three groups is very consistent 
with their functional annotation. 

is the total number of edges a group of genes is connected to, to describe 
the activitiy of each group genes. In Figure 10, we plotted the time courses 
of interactivity for the three groups respectively. For comparison, we nor- 
malize all scores to the range of [0, 1]. We see that the time courses have a 
nice correspondence with their supposed roles. For instance, embryonic de- 
velopment genes have the highest interactivity during embryonic stage, and 
post-embryonic genes increase their interacativity during larval and pupal 
stage. The muscle development genes are less specific to certain developmen- 
tal stages, since they are needed across the developmental cycle. However, 
we see its increased activity when the organism approaches its adult stage 
where muscle development becomes increasingly important. 

The estimated networks also recover many known interactions between 
genes. In recovering these known interactions, the dynamic networks also 
provide additional information as to when interactions occur during devel- 
opment. In Figure 11, we listed these recovered known interactions and the 
precise time when they occur. This also provides a way to check whether 
the learned networks are biologically plausible given the prior knowledge 
of the actual occurence of gene interactions. For instance, the interaction 
between genes msn and dock is related to the regulation of embryonic cell 
shape, correct targeting of photoreceptor axons. This is very consistent with 
the timeline provided by the dynamic networks. A second example is the 
interaction between genes sno and Dl which is related to the development 
of compound eyes of Drosophila. A third example is between genes caps 
and Chi which are related to wing development during pupal stage. What 
is most interesting is that the dynamic networks provide timelines for many 
other gene interactions that have not yet been verified experimentally. This 
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information will be a useful guide for future experiments. 
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Fig 11. Timeline of 45 known gene interactions. Each cell in the plot corresponds to one 
gene pair of gene interaction at one specific time point. The cells in each row are ordered 
according to their time point, ranging from embryonic stage (E) to larval stage (L), to pupal 
stage (P), and to adult stage (A). Cells colored blue indicate the corresponding interaction 
listed in the right column is present in the estimated network; blank color indicates the 
interaction is absent. 

We further studied the relations between 130 transcriptional factors (TF). 
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(a) Summary network 




(b) Time point 15 (mid-embryonic stage) (c) Time point 35 (mid- larval stage) 




(d) Time point 49 (mid-pupal stage) (e) Time point 62 (mid-adult stage) 



Fig 12. The largest transcriptional factors (TF) cascade involving 36 transcriptional fac- 
tors, (a) The summary network is obtained by summing the networks from all time points. 
Each node in the network represents a transcriptional factor, and each edge represents an 
interaction between them. The width of an edge is proportional to the number of the times 
the edge is present during the development; the size of a node is proportional to the sum 
of its edge weights. During different stages of the development, the networks are different, 
(b,c,d,e) shows representative networks for the embryonic, larval, pupal and adult stage of 
the development respectively. 

The network contains several cluster of transcriptional cascades, and we will 
present the detail of the largest transcriptional factor cascade involving 36 
transcriptional factors (Figure 12). This cascade of TFs is functionally very 
coherent, and many TFs in this network play important roles in the ner- 
vous system and eye development. For example, Zn finger homeodomain 1 
(zhfl), brinker (brk), charlatan (chn), decapentaplegic (dpp), invected (inv), 
forkhead box, subgroup (foxo), Optix, eagle (eg), prospero (pros), pointed 
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(pnt), thickveins (tkv), extra macrochaetae (emc), lilliputian (lilli), double- 
sex (dsx) are all involved in nervous and eye development. Besides functional 
coherence, the networks also reveals the dynamic nature of gene regulation: 
some relations are persistent across the full developmental cycle while many 
others are transient and specific to certain stages of development. For in- 
stance, five transcriptional factors, brk-pnt-zfhl-pros-dpp, form a long cas- 
cade of regulatory relations which are active across the full developmental 
cycle. Another example is gene Optix which are active across the full devel- 
opmental cycle and serves as a hub for many other regulatory relations. As 
for transience of the regulatory relations, TFs to the right of Optix hub re- 
duced in their activity as development proceeds to later stage. Furthermore, 
Optix connects two disjoint cascade of gene regulations to its left and right 
side after embryonic stage. 

The dynamic networks also provide an overview of the interactions be- 
tween genes from different functional groups. In Figure 13, we grouped 
genes according to 58 ontologies and visualized the connectivity between 
groups. We can see that large topological changes and network rewiring 
occur between functional groups. Besides expected interactions, the figure 
also reveals many seemingly unexpected interactions. For instance, during 
the transition from pupa stage to adult stage, Drosophila is undergoing a 
huge metamorphosis. One major feature of this metamorphosis is the de- 
velopment of the wing. As can be seen from Figure 13 (r) and 13 (s), genes 
related to metamorphosis, wing margin morphogensis, wing vein morpho- 
genesis and apposition of wing surfaces are among the most active group 
of genes, and they carry their activity into adult stage. Actually, many of 
these genes are also very active during early embryonic stage (for example. 
Figure 13(b) and 13(c)); the different is though they interact with different 
groups of genes. On one hand, the abundance of the transcripts from these 
genes at embryonic stage is likely due to maternal deposit (Arbeitman et al., 
2002); on the other hand, this can also be due to the diverse functionahties 
of these genes. In Table 1, we listed 12 genes related to wing development of 
Drosophila which have a diverse number of other functions. We can see that 
many of these genes also play roles in normal cell growth, cell proliferation 
and embryonic development. 
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(a) Avgerage network. Each color patch denotes an onto- 
logical group, and the position of these ontological groups 
remain the same from (a) to (u). The annotation in the 
outer rim indicates the function of each group. 
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Fig 13. Interactions between gene ontological groups related to developmental process un- 
dergo dynamic rewiring. The weight of an edge between two ontological groups is the total 
number of connection between genes in the two groups. In the visualization, the width of 
an edge is proportional to its edge weight. We thresholded the edge weight at 30 in (b)-(u) 
so that only those interactions exceeding this number are displayed. The average network 
in (a) is produced by averaging the networks underlying (b)-(u). In this case, the threshold 
is set to 20 instead. 
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Table 1 

Genes related to wing development also play roles in normal cell growth and embryonic 
development. Note that the genes in the left column of the table are all related to wing 
development. Hence in the right column of the table we only listed functions other than 

wing development. 



Gene Name 


Other Functions 


dachs (d) 


positive regulation of growth 


moleskin (msk) 


protein import into nucleus, docking, 

cell proliferation, compound eye development 


extra macrochaetae (emc) 


cell proliferation, nervous system development, 
sex determination 


decapentaplegic (dpp) 


anterior/posterior axis specification, cell fate specification, 
compound eye morphogenesis, heart development, 
germ-line stem cell division 


Delta (Dl) 


cell fate specification, compound eye development, 
mesoderm development, oogenesis, stem cell differentiation 
open tracheal system development. 


schnurri (shn) 


negative regulation of cell proliferation, midgut development, 
learning and/or memory 


tolloid (tld) 


embryonic pattern specification, terminal region determination, 
zygotic determination of dorsal/ ventral axis, 
regulation of transforming growth factor 


short stop (shot) 


cell cycle arrest, muscle attachment, oocyte fate determination 


rhea 


cell adhesion, regulation of cell shape, muscle attachment, 
negative regulation of gene-specific transcription 


held out wings (how) 


cell differentiation, embryonic development, 
mesoderm development, somatic muscle development, 
regulation of alternative nuclear mRNA splicing 


blistered (bs) 


cell fate commitment, open tracheal system development 


piopio (pio) 


chitin-based embryonic cuticle biosynthetic process 



5. Some properties of the algorithms. In this section we discuss 
some theoretical guarantees of the proposed algorithms. Mainly, we are in- 
terested in stating sufficient conditions under which the proposed methods 
are able to recover the unknown network correctly, i.e. they are model selec- 
tion consistent. So far, we have managed to find conditions under which the 
kernel smoothing method is model selection consistent. These conditions are 
similar in nature to the conditions established in Ravikumar et al. (2008) 
where a neighborhood selection method was analysed in the context of static 
graphs. 

The main result in this section concerns the scaling of the triple {n^pji^ s^i) 
under which the neighborhood selection procedure based on kernel smooth- 
ing is model selection consistent. The result we present is asymptotic in 
nature, however, we do not provide a classical asymptotic analysis where 
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one has fixed model and the sample size n is allowed to grow, instead, we 
consider the model dimension pn and the maximum node degree Sn to be 
increasing functions of n. This approach gives more insight on the proper- 
ties of the procedure, since it reflects the common situation arising in many 
applications where the number of features is larger than the sample size. 
As the model dimension to increase, we allow the estimation procedure to 
select a model that can represent a larger class of distribution. 

The main quantity that will determine whether our algorithm will succeed 
or fail is the Hessian of the log-likelihood function evaluated at the true 
model parameter ^*^. This quantity is a matrix Qu G M(^^^~^)><(^^^~^) defined 
for each node u eV dis: 

(14) ' ^ 

where r]{x;6) := 4exp(2a;n(6>\^,a;\^)) variance function. This matrix 

(exp(2xn(6'\^,x\^))+l) 

is known as the Fisher information matrix and it plays a similar role in the 
covariance matrix K[X^X^ ] of Gaussian graphical models in characterizing 
the success of the method. We will write Q* for the matrix Q*'* and will 
assume that the conditions hold for every node u E V. We will also drop 
explicit dependence on time when it is clear from context which time point 
we are referrin to. We use S = S^^ := {{u,v)\6l^y / 0} as the index set of 
the edges that are in the neighborhood of the vertex u. will denote its 
complement. Q^g denotes the sub-matrix of Q* indexed by S. 

All Dependency condition The subset of the Fisher information matrix 
corresponding to the relevant features has bounded eigenvalues: there 

is a constant Cmin > such that Amin(Q5'5') ^ Cmin, Vt G [0,1]. 

Furthermore, there is a constant i^max > such that A^ax (5^*) ^ 

i^^max, G [0,1]. 

A2: Incoherence condition There is an incoherence parameter a G (0, 1] 

such that WQ's^siQssy^Woo < VtG [0,1] 

A3: Smoothness conditions Let S*'^ = [o'uvit)]- There exist constants 

Aq > O^A such that max^^^-y sup^ ^ max^^-y sup^ ^ 

A. There also exists a constant Bq > O^B such that max^^^ sup^ ^ 

Bq and max^i^^sup^ \^uv{'^)\ ^ ^• 
A4: The kernel function K{') has a bounded support on [—1, 1] and has the 

following properties 

2 J vK(v)dv <2 J K(v)dv = 1, 2 y v^K{v)dv < 1. 
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The conditions Al and A2 guarantee that the procedure wih be able to 
recover the true structure at each time point t G [0,1]. These conditions 
could be relaxed to hold only for points t ^ T. Note that these conditions 
are the same as for the i.i.d. case analyzed in Ravikumar et al. (2008), when 
the graph is stable over time, and they are related to conditions proposed in 
analysis of the linear regression (e.g. Meinshausen and Biihlmann (2006)). 
The condition A3 describes our notion of the distribution changing smoothly 
over time. 

With these assumptions we are able to state our main result: 

Theorem 1. Assume that Al, A2, A3 and A4 hold. Furthermore, as- 
sume that the following conditions hold: 

1. hn = 0(n"i) 

2. s^h^ = o{l), ^ = o(l) and X, = O(^) 

Then for to G [0, 1]^ the estimated graph G(Ai, /in,^o) obtained through neigh- 
borhood selection satisfies 

(15) P 

for some constants C, C^ 

The theorem tells us that the procedure asymptotically recovers the graph, 
under appropriate regularization parameter A^^ as long as the ambient di- 
mensionality pn and the maximum node degree Sn are not too large, and 
minimum 6 values do not tend to zero too fast. 

Remarks: 

1. The bandwidth parameter is chosen so that it balances variance and 
squared bias of estimation of the elements of the Fisher information 
matrix. 

2. Condition 2 requires that the size of the neighborhood of each node 
remains smaller than the size of the samples. However, the model am- 
bient dimension pn is allowed to grow exponentially in n. 

3. Condition 3 is crucial to be able to distinguish true elements in the 
neighborhood of a node. We require that the size of the minimum 
element of the parameter vector stays bounded away from zero. 

The proof of Theorem 1 is given in the appendix. 
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As of now, we do not have a characterization of the solution given by the 
algorithm TV. The main difficulty in obtaining a theoretical result for this 
procedure seems to be the analysis of the interaction between the £i and 
TV{-) penalty. 

The problem in equation (9) is related to the multiple change point de- 
tection in which one wants to find the "best" partition of {0, ^, ... , 1} 
by intervals so that the unknown signal can be represented as a function of 
parameters that do not change over each interval. Usually one uses a penalty 
based on the number of intervals to select the "best" partition. The opti- 
miziation based on this penalty function is intractable and one can use a TV 
penalty as an alternative. The relationship between use of a penalty based 
on a number of partitions and use of the TV penalty is similar to the rela- 
tionship between the Iq penalty and the ii penalty used in model selection 
for the classical linear regression. 

Another relationship with model selection in the linear regression can be 
seen between the fussed lasso Tibshirani et al. (2005) and the TV penalty. As 
the TV penalty tries to select a partition of the space on which the features 
are defined, the fussed lasso tries to find a partition of the relevant features. 

6. Discussion. We have presented an important problem of structure 
estimation of the time varying networks. While the structure estimation of 
the static networks is an important problem in itself, in certain cases the 
estimated static structure is of limited use. A static structure only shows 
connections and interactions that are persistent throughout the whole time 
period and more refined methods are needed to learn time varying struc- 
tures. In this paper we have presented two methods that learn time varying 
networks from the observed discrete data. Even with these simple proce- 
dures, we are able to discover some patterns that would not be discovered 
using a method that estimates static networks. Being able to estimate more 
complex models comes with a price of more tuning parameters; the band- 
width parameter hn and the penalty parameter A2. 

There are still many ways to improve the methods presented here. More 
principled ways of selecting tuning parameters are definitely needed. Select- 
ing the tuning parameters in neighborhood selection procedure for static 
graphs is not an easy problem and estimating time varying graphs does not 
simplify the problem. Methods presented here do not allow for the incorpo- 
ration of existing knowledge on the network topology into the algorithm. In 
some cases, data is very scarce and we would like to incorporate as much of 
the prior knowledge into the procedure, so directing research towards finding 
Bayesian flavored algorithms seems very important. 
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Two methods presented here present two different ends of the spectrum: 
one algorithm is able to estimate smoothly changing networks, while the 
other is more tailored towards estimation of structural changes in the model. 
It is important to combine the two. There is a great amount of work in 
nonparametric estimation of change points and it would be interesting to try 
to make use of some of those methods for estimating time varying networks. 

Appendix. In this appendix we very briefly outline the main proof idea 
of Theorem 1. The proof itself is quite technical and we plan to present it 
in a separate publication. The main idea is that when the assumptions Al 
to A4 are satisfied, for appropriately chosen tuning parameters hn and Ai, 
the empirical estimate Q of the Fisher information matrix Q* will, with high 
probability, satisfy Al and A2. Conditioned on the event that the estimate Q 
has these "good properties" Al and A2, the structure of the network will be 
estimated correctly, also with high probability. This line of reasoning is not 
new and was employed in Ravikumar et al. (2008) to prove model selection 
consistency of the neighborhood selection method for static networks. The 
time varying networks present a lot of technical issues, however. 

We outline the proof of Theorem 1 for a fixed time point to. Let us define 
a "good" event on which we can show that the probability of estimat- 
ing a wrong neighborhood of node u is converging to 0, i.e. P[A/'(ix,to) ^ 
J\f{u, to)\£] 0. Under the assumptions Al to A4, it can be shown that the 
event £ happens with probability converging to 1, i.e. P [£^] 0. From these 
two parts, we obtain that the neighborhood estimation of each node can be 
achieved exponentially fast, and the main result follows from applying the 
union bound over all Pn nodes. 

Let Q = Qu = ^^wtri{x^]9'''^^)x\^^x\^^ denote the estimate of the Fisher 

information matrix and let = i^u — St ^t^\u^\u denote the estimate of 
the covariance matrix at time point to. The "good" event is defined as 

(16) £ = { T^n \ Q and 11 satisfy assumptions Al and A2 }. 

The two main technical difficulties that we have to deal with are: (1) we 
have to show that the estimate Q converges to Q* for matrices of increasing 
size and (2) we do not have the i.i.d. sample, so we cannot use standard 
large deviation bounds. 

On the event £, we can show that the neighborhood selection procedure 
selects the model consistently analyzing the Karush-Kuhn- Tucker (KKT) 
conditions associated with the convex problem (4). A constructive procedure 
can be given that produces a solution that satisfies the KKT conditions with 
high probability. We have the following result: 
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Lemma 2. Assume that the assumptions Al and A2 hold for the estimate 
Q. If the following conditions hold 



1, hn = 0(n"3) 

2. s^h^ = o{l), ^ ,(1) and A. = OU^ 



3. 



nhn 



then for to G [0, 1]^ the estimated graph G{Xn,tQ) obtained through neighbor- 
hood selection satisfies 

(17) nG{Xn, to) = G'° I S] = 0{exp{-CnhnXl + C logp)) 
for some constants C, C^ 

The remainder of the argument goes by showing that the event £ hap- 
pens with high probabihty. To show this, we bound the deviation between 
elements of Q — [quv{to)] and = [Quv{to)]- We use the fohowing decom- 
position 



(18) 



X^o) - quv{to)\ < 



+ 



+ 



t t 



and proceed with bounding each individual term. The following lemma gives 
us behaviour of the terms in (18). 

Lemma 3. Assume that the conditions of Theorem 1 and A3 are satis- 
fied. Let — wtri{X^\Q''-'^^X\X\. There is a constant C > 0; depending 
on K(-) only, such that for any to G [0, 1].' 

(19) max \quv{to) -^mv{x^'.0''^^)xixl\ = 0{snhn) 

(20) P[| J2{Zl - E[ZU)\ > e] < 2exp(-Cn/i„e2) 

t 

(21) max |E[^^,7/(x^r'0x^.4]-^n.(to)| = 0(M. 

Now, we are ready to state the results that show that the estimate Q 
satisfies Al and A2: 



imsart-aoas ver. 2008/08/29 file: paper_aoas_2008.tex date: December 30, 2008 



ESTIMATING TIME- VARYING NETWORKS 



33 



Lemma 4. Under assumptions Al and A3 and any to G [0, 1]^ we have 



(22) AminiQss) > C^in " OM ^IL^ + s^h^). 

V ^/^n 

Lemma 5. Under assumptions A2, A3, for any to G [0, 1]^ we have 

(23) ||te(Qss)-'||oo<l-| 
with probability 1 - (!?(exp(-C^ + C'logp)). 

Similarly we can show that the estimate S satisfies the assumption AL 
We omit the details. 

Putting Lemmas 2, 3, 4, 5 together we prove Theorem 1. 
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